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ABSTRACT 

This paper analyses language modeling in spoken 
dialogue systems for accessing a database. The use of 
several language models obtained by exploiting dialogue 
predictions gives better results than the use of a single 
model for the whole dialogue interaction. For this reason 
several models have been created, each one for a specific 
system question, such as the request or the confirmation 
of a parameter. 

The use of dialogue-dependent language models in- 
creases the performance both at the recognition and at 
the understanding level, especially on answers to sys- 
tem requests. Moreover using other methods to increase 
performances, like automatic clustering of vocabulary- 
words or the use of better acoustic models during recog- 
nition, does not affect the improvements given by dia- 
logue-dependent language models. 

The system used in our experiments is Dialogos, the 
Italian spoken dialogue system used for accessing rail- 
way timetable information over the telephone. The ex- 
periments were carried out on a large corpus of dialogues 
collected using Dialogos. 

1. INTRODUCTION 

In a spoken dialogue system (SDS) a method to improve 
speech recognition and speech understanding is to use 
contextual knowledge as a constraint, both at the recog- 
nition and at the parsing level [jjj. 

Carter H shows that clustering the sentences of the 
training corpus into subcorpora on the basis of the crite- 
rion of minimizing entropy, improves n-gram based lan- 
guage models. We propose that the splitting of a corpus 
acquired from a SDS should be done according to the di- 
alogue point in which an utterance was given. On these 
subcorpora a set of more specific n-gram based language 
models was trained. This work extends the previous one 
described in 0], where first insights into the usefulness 
of dialogue predictions were given on a corpus acquired 
with an earlier version of the dialogue system, see 

Our use of dialogue prediction is similar to the static 
prediction described in [EJ and is related to the dialogue- 
step dependent models in |J, the difference being that 
we also measured performance at the understanding le- 
vls el. 

Other methods to improve SDS performances in con- 
junction with the use of dialogue predictions were tested. 



The work developed in Q was exploited and the vocab- 
ulary-words (VW) were clustered automatically. Fur- 
ther improvement was obtained using acoustic models 
trained on a larger training-set of domain specific utter- 
ances. It's remarkable that even in those cases the im- 
provements given by dialogue-dependent language mod- 
els were not affected. 

2. THE SYSTEM USED FOR THE 
ACQUISITION 

Dialogos is an all-software, completely integrated, di- 
alogue system which runs very close to real-time on a 
DEC Alpha, except for the telephonic interface and text- 
to-speech synthesizer which are run from a PC equipped 
with a D41E Dialogic board. 

The acoustical front-end performs feature extraction 
and acoustic-phonetic decoding. The recognition mod- 
ule is based on a frame- synchronous Viterbi decoding, 
where the acoustic matching is performed by a pho- 
netic neural network ||. The vocabulary of Dialogos 
contains 3,471 words, clustered in 358 classes. 348 of 
them contain a single word, while the remaining 10 
classes contain semantically important words, such as 
city names (2,983 words), station names (33 words), 
numbers (76 words), months, week days, and so on. 
During the recognition, a class-based bigram language 
model is used. It was trained on 30,000 sentences. The 
training data of the language models was partially de- 
rived from a previous trial of SDS applied to the same 
domain, but for the most part (86%) it was manually 
created. 

The linguistic processor starts from the best-decoded 
sequence, and it performs a multi-step robust partial 
parsing, which is an improvement of It accepts par- 
tial solutions on the basis of their coherence with re- 
spect to the parser's linguistic knowledge and generates 
a task- oriented semantic caseframe. 

To interpret a new utterance in the on-going interac- 
tion, the dialogue module (DM) takes into account the 
linguistic history and the active focus. This mechanism 
allows the DM to identify linguistic references, find out 
the correct context to apply to utterance interpretation, 
and decide if an utterance causes a shift or restriction 
of focus [ju^. The result of the contextual interpreta- 
tion is the choice of a proper dialogue act (DA), such 
as the request or the confirmation of a parameter, and 
the generation of a system answer. The DM makes use 
of pragmatic expectations about what the user would 



probably say in a certain dialogue state. On the ba- 
sis of these contextual based expectations the DM can 
generate predictions. 



Si> 
Ui> 

s 2 > 

u 2 > 
s 3 > 



u 3 > 



u 4 > 

S 5 > 



Where are you leaving from and going to? 

< request : departure_city , arrival_city > 
Prom Turin to Milan. 

Are you leaving from Turin for Milan? 

< confirm : departure -city, arrival_city > 
Yes tomorrow at about eight o'clock. 

Do you want to leave tomorrow at about eight 
o'clock? 

< confirm : departure jiate, arrivaljtime > 
Yes. 

I have found two connections . . . Do you want 
other information about these connections? 

< confirm : connectionSn formation > 
No thanks. 

Thank you for the call. Good-bye. 



Figure 1: Example of a dialogue interaction. 

Using Dialogos a corpus^] of near 2,000 dialogues for 
a total of 19,697 utterances was acquired. A dialogue 
example is shown in Figure [l], where for each system sen- 
tence (Si >) the DA and the parameters are given. This 
information can also be used for predicting a more spe- 
cific language model which better represents the syntac- 
tic, semantic, and contextual constraints of the future 
user's answer. 



3. PREDICTIONS 

The concept of prediction constitutes the guessing of a 
future action and it is commonly used to obtain con- 
straints in a certain point of a dialogue. In an infor- 
mation inquiry system the knowledge to estimate the 
subset of user's DA already exists. In the VERMOBIL 
system for instance, a special module estimates the 
set of DAs in the next user utterance and a stochastic 
recovery is done when the prediction fails. In our system 
a certain point in a dialogue is identified by the question 
that the user is replying to, i.e. the DA of the system 
generated sentence, which is called in the following dia- 
logue prediction (DP). 

At the recognition level, we make use of the informa- 
tion that the DM can provide, by creating specific LMs 
for each DP. The most specific LM is obtained from a 
training-set which only contains replies given in a cer- 
tain DP. However, some questions very rarely appear 
and for them the information contained in the training 
DB is not enough to obtain a robust LM. 

3.1. QUESTION CLASSIFICATION 

The system questions were classified in a natural way. 
At first they were divided into groups according to the 
type of DA: request for (Ri) and confirmation of (C;) 
a parameter i, and listing of train information (Info). 
Then these groups were separated into DAs involving 
one or more parameters, and, finally, a distinction was 
made between the different parameters dealt with by 



the questions, such as departure city (p), arrival city 
(o), departure time (t), and departure date (d). For 
example, C p is the confirmation of the departure city, 
Rt the request of the departure time, and R p & R a 
the request of both the departure and the arrival cities 
through a single sentence. In Figure | the various classes 
are shown together with the frequencies of occurrence in 
the acquired corpus. 

Bearing in mind these distinctions, a specific train- 
ing set for each class was obtained. The utterances of 
a specific training-set include all the instances of dif- 
ferent user's answers in that point of the dialogue, for 
instance in the C p training-set there are both positive 
and negative confirmations. 




Figure 2: Relative frequencies of the classes. 



3.2. CREATION OF THE MODELS 

After obtaining the training-sets for each specific class, 
different models were created with the same algorithm 
used for a single context-independent model. All the re- 
sults presented in this paper were obtained using both 
a bigram model during the acoustic decoding and a tri- 
gram one for the rescoring of the 25 n-best sequences. 

4. EXPERIMENTAL RESULTS 

We carried out two sets of experiments using either a 
single model for all utterances or a set of specific models 
that takes into account the predictions described before. 
Both the context-independent and the specialized mod- 
els were trained on the same material, 15,575 user utter- 
ances, and tested on 2,040 ones. The two sets were dis- 
junctive. Performance is measured at both recognition 
and understanding levels. Recognition performance is 
measured in terms of sentence accuracy (SA) and word 
accuracy (WA), and understanding one in terms of sen- 
tence understanding (SuFJ) and concept accuracy (CA). 



1 A part of this corpus collected from 493 naive users 
(1,363 dialogues, 13,123 utterances) is reported in , where 
the evaluation results of the system are given. 



2 SU is obtained comparing for each sentence the case- 
frame generated by the parser with a manually corrected one. 
The CA takes into account substitution, insertion, and dele- 
tion of concepts, i.e. attibute-value pairs in thc^cascframc. 
The CA formula is similar to the WA one, see ]13[. 



4.1. SINGLE CONTEXT-INDEPENDENT 
MODELS 



5. PREDICTIONS VS. OTHER 
IMPROVEMENTS 



Table |l] shows the comparison of the performance of the 
LM used during the acquisition (baseline) and a sin- 
gle dialogue-independent LM obtained with the whole 
training-set (ALLJNT). The baseline model was mainly 
trained on manually created data, which some of them 
are unusual in a dialog interaction, and so this model 
shows a poor level of specificity. The ALLJNT model, 
on the other hand, is far more specific, as it only includes 
utterances occurred through the user dialogues, and so 
it reflects the distribution of the utterances in a real 
setting. Both at the recognition and the understanding 
levels the ALLJNT model gives a better performance. 





SA 


WA 


SU 


CA 


baseline 


69.4 


68.8 


76.1 


66.4 


ALLJNT 


70.9 


71.1 


77.6 


68.5 


ALL_PRED 


71.2 


73.1 


79.4 


72.2 


FINAL 


71.5 


73.4 


79.8 


72.5 



Table 1: Results of single models and models with DP. 



It is interesting to test if the increment of performance 
brought by the use of DP is affected by the use of other 
methods. Two methods were tested, such as: the auto- 
matic clustering of vocabulary words (ACVW) and the 
use of acoustic models trained on a larger set of domain 
specific utterances. 

5.1. LANGUAGE MODELS WITH 
AUTOMATIC CLUSTERING OF 
VOCABULARY WORDS 

Word clustering is commonly used to reduce number of 
parameters of a LM. This could increase the statistical 
robustness and reduce the size of the model itself. At 
first, most of the classes (348 from 358) had one single 
word, and these classes were clustered again in auto- 
matic way using Maximum likelihood method^], as de- 
scribed in The final number of classes was 120. Two 
models FINAL-clust, and ALLJNT-clust were trained 
on the same database as FINAL, and ALLJNT de- 
scribed above, but the word classification was changed 
from 358 to 120 classes. 



4.2. LANGUAGE MODELS WITH 
DIALOGUE PREDICTIONS 

A set of two models with DP were tested. The first 
one, ALLJ'RED, was created as described in Section 
3.2. Another one, FINAL, takes for each class the best 
between the single model (ALLJNT) and the model 
with DP (ALLJ'RED), according to the SU metric. For 
classes containing a few utterances the ALLJNT model 
was preferable, for instance, in the class "confirmation 
of departure city" (Cp), so in this case it was selected. 
The results for the models with DP are also given in 
Table |l|. They show that the use of DP almost dou- 
ble the improvement obtained with the ALLJNT model 
alone. The error rate reduction between ALLJNT and 
FINAL is near 10% for WA and SU, and over 20% for 
CA. These improvements are encouraging because they 
compare favorably with the ones reported in ||. 

The improvements became clearer if we separate the 
test utterances into requests for and confirmations of 
a parameter, as shown in Table [| Through the use 
of DP (the FINAL model) a general improvement for 
the request utterances of 2-4% was achieved. This was 
slightly reduced for the confirmations, because about 
70% of them are utterances of only one word ("Yes", 
"No", "Okay", and so on), which are always correctly 
recognized. 







SA 


WA 


SU 


CA 


request 


ALLJNT 


60.8 


74.6 


67.4 


60.6 


request 


FINAL 


62.8 


78.9 


71.3 


66.3 


confirm 


ALLJNT 


77.3 


71.9 


84.6 


76.5 


confirm 


FINAL 


76.9 


71.3 


85.4 


78.1 



Table 2: Results for requests and confirmations. 



5.2. USE OF MORE SPECIFIC ACOUSTIC 
MODELS 

All experimental results till now, have used an acoustic 
model (Ml) trained on a set of two DBs. The first is 
a domain independent one, which contains phonetically 
balanced data produced by 1,136 speakers, 4,875 utter- 
ances (with an average length of 6 words) and 3,653 iso- 
lated words. The second one is domain dependent, and 
it includes 3,580 utterances (with an average length of 
2 words) from 270 speakers. It came from an older SDS 
acquisition. A new acoustic model (M2) was created by 
adding 13,929 utterances (with an average length of 2 
words), from the corpus described in Section 2, to the 
domain dependent DB part of Ml. 

5.3. FINAL COMPARISON 

Table || shows WA and SU results for the LMs with au- 
toclassification using both Ml and M2 acoustic models. 
Autoclassfication only (Ml columns) improved both the 
single model and the DP one, compared to the results 
in Table |lj and, as expected, the M2 acoustic models 
furtherly increment the recognition and understanding 
results. In any case these improvements does not alter 
the advantage obtained by the use of DP. 

The diagram in Figure ^[represents the error rate re- 
duction values between ALLJNT and FINAL LMs, for 
three different experimental settings, which are: with- 
out ACVW using Ml (-clust/Ml); with ACVW still us- 
ing Ml (+clust/Ml); and with ACVW but using M2 
(+clust/M2). The diagram shows clearly that in each 
case the LMs which use DP give better recognition and 

3 In M several clustering methods were compared through 
the perplexity values and they gave similar results. In this 
work the choice of the best automatic clustering method was 
made experimentally. 





WA 


SU 




Ml 


M2 


Ml 


M2 


ALLJNT-clust 


71.9 


73.8 


79.0 


81.4 


FINAL-clust 


73.4 


75.6 


80.8 


83.5 



Table 3: Comparison between models with ACVW. 




-clust +clust +clust 



M1 M1 M2 

Figure 3: Error reduction among all the experimental 
settings. 



understanding results (all the error rate reduction val- 
ues are positive). Its also remarkable that the use ol 
DP, in conjunction with other methods, could even in- 
crease the improvement. All the values ol +clust/M2 
are greater then the -clust/Ml ones, so for SU it goes 
from 10.9% (for -clust/Ml) to 12.7% (for +clust/M2), 
and from 22.9% to 28.7% for CA. However in +clust/Ml 
the error reduction is the smallest, because the ACVW 
improve above all the single model (ALLJNT-clust). 

6. CONCLUSIONS 

It has been shown that more specific models (created ex- 
clusively with replies given at a certain point of the dia- 
logue) improve globally the performance of SDS. On the 
other hand, in some cases the specific models are not ro- 
bust enough (i.e. very rare, but appropriate utterances). 
The trade-off between specificity and robustness should 
be better studied in future. The improvement of the 
performance for requests suggests a proportional general 
improvement of the whole system, because it implies a 
higher number of positive replies to the following confir- 
mation and the reduction of the number of turns in the 
dialogue for some unnecessary recovery. Moreover the 
use of DP is useful in conjunction with other methods, 
such as the autoclassification of vocabulary words and 
the use of more specific acoustic models. These kind of 
dialogue-dependent LMs have been already integrated 
into Dialogos system. 
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